Loading libaries:
library("RCurl")
library(dplyr)
library(readr)
library(tidyr)
library(stringr)
library(knitr)
library(kableExtra)
library(mice)
library(VIM)
library(e1071)
library(ggplot2)
library(corrplot)
library(plotly)
library(caret)
library(doMC)
library(xgboost)
Listing of libraries loaded during code execution can be found in the cell below:
## [1] "minqa, class, rio, rstudioapi, prodlim, lubridate, ranger, xml2, codetools, splines, robustbase, zeallot, jsonlite, nloptr, broom, compiler, httr, backports, assertthat, Matrix, lazyeval, htmltools, tools, gtable, glue, reshape2, Rcpp, carData, cellranger, vctrs, nlme, lmtest, timeDate, xfun, gower, laeken, openxlsx, lme4, rvest, lifecycle, pan, DEoptimR, MASS, zoo, scales, ipred, hms, yaml, curl, rpart, stringi, boot, zip, lava, rlang, pkgconfig, evaluate, purrr, recipes, htmlwidgets, tidyselect, plyr, magrittr, R6, generics, mitml, pillar, haven, foreign, withr, survival, abind, sp, nnet, tibble, crayon, car, jomo, rmarkdown, readxl, forcats, ModelMetrics, vcd, digest, webshot, stats4, munsell, viridisLite"
Data is loded accoring to the metadata provided in the params object, where a location of a dataset (both local and url) is provided. Mentioned object contains also information about expected form of missing data. In this case
if(file.exists(params$dataset_name)) {
dataset <- read.csv(params$dataset_name, na.strings = params$na_chars)
} else if(url.exists(params$dataset_url)){
download.file(params$dataset_url, params$dataset_name)
dataset <- read.csv(params$dataset_name, na.strings = params$na_chars)
} else {
stop("There is no file nor url resource to work with!")
}
To see dimensionality we’ve used dim function:
dim(dataset)
## [1] 52582 16
There are 52582 row and 16 columns.
Here are names of all columns available in a raw dataset:
dataset %>%
colnames %>%
print
## [1] "X" "length" "cfin1" "cfin2" "chel1" "chel2" "lcop1"
## [8] "lcop2" "fbar" "recr" "cumf" "totaln" "sst" "sal"
## [15] "xmonth" "nao"
First sight at some rows from dataset will help us get better view on the data.
head(dataset)
## X length cfin1 cfin2 chel1 chel2 lcop1 lcop2 fbar recr
## 1 0 23.0 0.02778 0.27785 2.46875 NA 2.54787 26.35881 0.356 482831
## 2 1 22.5 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831
## 3 2 25.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831
## 4 3 25.5 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831
## 5 4 24.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831
## 6 5 22.0 0.02778 0.27785 2.46875 21.43548 2.54787 NA 0.356 482831
## cumf totaln sst sal xmonth nao
## 1 0.3059879 267380.8 14.30693 35.51234 7 2.8
## 2 0.3059879 267380.8 14.30693 35.51234 7 2.8
## 3 0.3059879 267380.8 14.30693 35.51234 7 2.8
## 4 0.3059879 267380.8 14.30693 35.51234 7 2.8
## 5 0.3059879 267380.8 14.30693 35.51234 7 2.8
## 6 0.3059879 267380.8 14.30693 35.51234 7 2.8
Names of some columns may be ambiguous. This is brief explanation:
## Dataset contains following attributes:
## - length - length of the fished herring [cm]
## - cfin1 - density of Calanus finmarchicus kat. 1 plankton
## - cfin2 - density of Calanus finmarchicus kat. 2 plankton
## - chel1 - density of Calanus helgolandicus kat. 1 plankton
## - chel2 - density of Calanus helgolandicus kat. 2 plankton
## - lcop1 - density of copepods kat. 1
## - lcop2 - density of copepods kat. 2
## - fbar - intensity of fishing in region [remaining part of herring fry]
## - recr - annual number of herrings
## - cumf - intensity of annual fishing in region [remaining part of herring fry]
## - totaln - number of fished herrings
## - sst - sea surface temperature [Celsius]
## - sal - salinity [Kundsen ppt]
## - xmonth - number of fishing month
## - nao - north atlantic oscillation [mean sea level pressure]
After that dataset can be transferred to dyplr’s DataFrame object. By selecting just columns with numeric type and then checking if all column names match those from the raw set, we can make sure that all columns are ready for further work.
herrings <- tbl_df(dataset)
herrings %>%
select_if(is.numeric) %>%
colnames
## [1] "X" "length" "cfin1" "cfin2" "chel1" "chel2" "lcop1"
## [8] "lcop2" "fbar" "recr" "cumf" "totaln" "sst" "sal"
## [15] "xmonth" "nao"
It is also required to transform xmonth column to cathegorical, as its values are numbers of months.
herrings <- herrings %>%
mutate(xmonth = as.factor(xmonth))
All columns were extracted correctly and do not contain anything different from numeric variables. But there are still missing values (NAs) in the dataframe. Before deciding what to do with them it is crucial to know how many rows have NA value.
We started with showing the list of columns that have at least one NA value.
cols_with_na = c()
for(col in colnames(herrings)){
if(any(is.na(herrings[col]))){
cols_with_na <- cols_with_na %>% append(col)
}
}
cols_with_na
## [1] "cfin1" "cfin2" "chel1" "chel2" "lcop1" "lcop2" "sst"
There are 7 such columns in herrings dataset.
Before making any decision about missing data it is important to somehow visualize it. It is recommended to not reconstruct variable if more than 5% of the data is not available. Also it is worth knowing missing data pattern. If it’s random it is good idea to try some imputation methods. But if it seems that there is some reason for data being not available it’s better to consider other methods (profound analysis of data gathering process is a way to start).
Following plots contains interesting information about missing data. It was done using VIM’s aggr function on dataset with just those columns which contain at least NA value. Left figure, “Histogram of missing data” informs us how many values of each column are missing. We can see that it is around 3% of all values for each column. It is less than mentioned 5% so data reconstruction is worth considering. More than that, as all columns has around 3% missing data it is first hint that not availability might be caused by some random process. The right chart, “Missing data pattern”, shows what is the pattern of missing data. All combinations of missing data were captured from the dataset and presented. Red color symbolizes NA, while blue indicate that correct value is available. Most common combinations are at the very bottom. Moving to the top there are less and less frequent combinations.
It seems that there are no strong relations between missing values. Next step is an imputation.
just_with_na <- herrings %>% select(cols_with_na)
aggr_plot <- aggr(just_with_na, col=c('navyblue','red'), numbers=FALSE, bars=TRUE, sortVars=TRUE, labels=names(just_with_na), cex.axis=.8, gap=3, ylab=c("Histogram of missing data","Missing data pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## lcop1 0.03143661
## lcop2 0.03025750
## sst 0.03012438
## cfin1 0.03006732
## chel2 0.02959188
## chel1 0.02957286
## cfin2 0.02921152
In order to reconstruct missing data we used a mice library (Multivariate Imputation By Chained Equations). Seed for a random generator was set to 17. MICE tries to imput values in places of missing data based on the context from other attributes. One of attributes, sal, had to be excluded from this step as its participation resulted in error. MICE supports many methods from which one could choose by first trying its reconstruction results on smaller subsequence of data and compare that to ground truth. We omitted that step and went straight to fairly good pmm imputation method. Its name stands for "Predictive Mean Matching.
set.seed(17)
imputed <- mice(herrings,predictorMatrix = quickpred(herrings, exclude = c('sal')), meth='pmm', maxit=1)
##
## iter imp variable
## 1 1 cfin1 cfin2 chel1 chel2 lcop1 lcop2 sst
## 1 2 cfin1 cfin2 chel1 chel2 lcop1 lcop2 sst
## 1 3 cfin1 cfin2 chel1 chel2 lcop1 lcop2 sst
## 1 4 cfin1 cfin2 chel1 chel2 lcop1 lcop2 sst
## 1 5 cfin1 cfin2 chel1 chel2 lcop1 lcop2 sst
tmp_data <- complete(imputed,1)
imputed_data = mutate(tmp_data, sal = herrings$sal)
We started studying attributes by showing their summary which contains basic statistical metrics. There are no missing values as we got rid of them earlier.
kable_styling(kable(summary(imputed_data)))
| X | length | cfin1 | cfin2 | chel1 | chel2 | lcop1 | lcop2 | fbar | recr | cumf | totaln | sst | sal | xmonth | nao | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0 | Min. :19.0 | Min. : 0.0000 | Min. : 0.0000 | Min. : 0.000 | Min. : 5.238 | Min. : 0.3074 | Min. : 7.849 | Min. :0.0680 | Min. : 140515 | Min. :0.06833 | Min. : 144137 | Min. :12.77 | Min. :35.40 | 8 : 9920 | Min. :-4.89000 | |
| 1st Qu.:13145 | 1st Qu.:24.0 | 1st Qu.: 0.0000 | 1st Qu.: 0.2778 | 1st Qu.: 2.469 | 1st Qu.:13.427 | 1st Qu.: 2.5479 | 1st Qu.:17.808 | 1st Qu.:0.2270 | 1st Qu.: 360061 | 1st Qu.:0.14809 | 1st Qu.: 306068 | 1st Qu.:13.60 | 1st Qu.:35.51 | 10 : 7972 | 1st Qu.:-1.89000 | |
| Median :26290 | Median :25.5 | Median : 0.1111 | Median : 0.7012 | Median : 5.750 | Median :21.435 | Median : 7.0000 | Median :24.859 | Median :0.3320 | Median : 421391 | Median :0.23191 | Median : 539558 | Median :13.86 | Median :35.51 | 7 : 6922 | Median : 0.20000 | |
| Mean :26290 | Mean :25.3 | Mean : 0.4457 | Mean : 2.0228 | Mean :10.003 | Mean :21.204 | Mean : 12.8051 | Mean :28.422 | Mean :0.3304 | Mean : 520366 | Mean :0.22981 | Mean : 514973 | Mean :13.87 | Mean :35.51 | 9 : 5714 | Mean :-0.09236 | |
| 3rd Qu.:39436 | 3rd Qu.:26.5 | 3rd Qu.: 0.3333 | 3rd Qu.: 1.7936 | 3rd Qu.:11.500 | 3rd Qu.:27.193 | 3rd Qu.: 21.2315 | 3rd Qu.:37.232 | 3rd Qu.:0.4560 | 3rd Qu.: 724151 | 3rd Qu.:0.29803 | 3rd Qu.: 730351 | 3rd Qu.:14.16 | 3rd Qu.:35.52 | 6 : 4218 | 3rd Qu.: 1.63000 | |
| Max. :52581 | Max. :32.5 | Max. :37.6667 | Max. :19.3958 | Max. :75.000 | Max. :57.706 | Max. :115.5833 | Max. :68.736 | Max. :0.8490 | Max. :1565890 | Max. :0.39801 | Max. :1015595 | Max. :14.73 | Max. :35.61 | 5 : 3736 | Max. : 5.08000 | |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | (Other):14100 | NA |
Figure below presents correlation matrix (or rather its upper part). Attribute length is negatively correlated with sea surface temperature, but besides that it is fairly weakly connected with other attributes.
#corrplot(imputed_data, type="upper", order="hclust", tl.col="black", tl.srt=45)
cor_data <- select(imputed_data, -c('xmonth', 'X'))
corrplot(cor(cor_data), type='upper', order='hclust', tl.col='black', tl.srt=55, title="Correlation matrix")
The chart below presents how a leanth of herrings was changing over time. Blue line is an effect of smooth function. It allows us to clearly see trend present in data. There is noticeable breakpoint.
tmp <- imputed_data %>%
dplyr::slice(seq(1, nrow(imputed_data), 50))
p <- ggplot(tmp, aes(x=X, y=length)) +
geom_point(shape=1) + stat_smooth()
p <- ggplotly(p)
p
It is standard practice to check distributions of data plotting histograms and measure the distribution skewness.
for(col in colnames(imputed_data)){
if(col != "X"){
plot <- qplot(imputed_data[[col]], xlab=col) + ggtitle("")
print(plot)
print(skewness(imputed_data[[col]]))
}
}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] -0.09947805
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 9.029307
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 3.422597
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 3.231852
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 0.4729533
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 2.450582
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 0.8696977
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 0.4447865
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 1.043071
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] -0.01135863
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 0.0624455
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] -0.04888691
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] -0.5374667
## Warning in mean.default(x): argument is not numeric or logical: returning
## NA
## Warning in Ops.factor(x, mean(x)): '-' not meaningful for factors
## [1] NA
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] -0.02178226
There are some variables that should be looked closer upon. For those that skewness was > 2 a log transform will be used. There could be more analysis done to all of the variables but authors decide that it is enough to further investigate only some of them. Depending on the number of zeros there will be a constant (before a log transform) added or dummy variable introduced that takes value of 1 if the column was zero and 0 otherwise. This way a more suitable distribution can be obtained and important information regarding zeros in the data kept.
number_of_zeros = NROW(imputed_data$cfin1[imputed_data$cfin1== 0])
For the variable cfin1 - log transform was used and a dummy variable introduced. There were 0.2804192% zero values in this column.
print(paste("Number of zeros: ", number_of_zeros))
## [1] "Number of zeros: 14745"
imputed_data$cfin1Changed[imputed_data$cfin1 > 0] <- log(imputed_data$cfin1[imputed_data$cfin1 > 0])
imputed_data$cfin1Changed[imputed_data$cfin1 <= 0] <- 0
plot <- qplot(imputed_data$cfin1Changed[imputed_data$cfin1 > 0], xlab="cfin1", binwidth=0.05)
print(plot)
imputed_data$cfin1Zero <- as.numeric(imputed_data$cfin1 == 0)
print(qplot(imputed_data$cfin1Zero,xlab="cfin1Zero"))
number_of_zeros = NROW(imputed_data$cfin2[imputed_data$cfin2== 0])
Same technique was used for cfin2 variable as there were still 0.0746453% records with zero values.
print(paste("Number of zeros: ", number_of_zeros))
## [1] "Number of zeros: 3925"
imputed_data$cfin2Changed[imputed_data$cfin2 > 0] <- log(imputed_data$cfin2[imputed_data$cfin2 > 0])
imputed_data$cfin2Changed[imputed_data$cfin2 <= 0] <- 0
plot <- qplot(imputed_data$cfin2Changed[imputed_data$cfin2 > 0], xlab="cfin2", binwidth=0.05)
print(plot)
imputed_data$cfin2Zero <- as.numeric(imputed_data$cfin2 == 0)
print(qplot(imputed_data$cfin2Zero,xlab="cfin2Zero"))
number_of_zeros = NROW(imputed_data$chel1[imputed_data$chel1== 0])
For ‘chel1’ there were 0.0365905% zero values (<5%) that is why log transform with an added constant was preferred. The 5% value is not a rule but rather a threshold authors decided on. More appropriate solution could be to test all possible variable setups.
print(paste("Number of zeros: ", number_of_zeros))
## [1] "Number of zeros: 1924"
imputed_data$chel1Changed <- log(imputed_data$chel1 +1)
plot <- qplot(imputed_data$chel1Changed, xlab="chel1", binwidth=0.05)
print(plot)
number_of_zeros = NROW(imputed_data$lcop1[imputed_data$lcop1== 0])
“lcop1” was only log transformed with an added constant. There were 0% zero values!
print(paste("Number of zeros: ", number_of_zeros))
## [1] "Number of zeros: 0"
imputed_data$lcop1Changed <- log(imputed_data$lcop1 +1)
plot <- qplot(imputed_data$lcop1Changed, xlab="lcop1", binwidth=0.05)
print(plot)
First we start with turning xmonth into dummy variable. Now there are new columns for each month. Zero indicates that sample was not taken in this month and one if it was taken in month which is specified in the name of the column.
reg_data <- model.matrix(X ~ ., imputed_data)
head(reg_data)
## (Intercept) length cfin1 cfin2 chel1 chel2 lcop1 lcop2
## 1 1 23.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881
## 2 1 22.5 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881
## 3 1 25.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881
## 4 1 25.5 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881
## 5 1 24.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881
## 6 1 22.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881
## fbar recr cumf totaln sst sal xmonth2 xmonth3
## 1 0.356 482831 0.3059879 267380.8 14.30693 35.51234 0 0
## 2 0.356 482831 0.3059879 267380.8 14.30693 35.51234 0 0
## 3 0.356 482831 0.3059879 267380.8 14.30693 35.51234 0 0
## 4 0.356 482831 0.3059879 267380.8 14.30693 35.51234 0 0
## 5 0.356 482831 0.3059879 267380.8 14.30693 35.51234 0 0
## 6 0.356 482831 0.3059879 267380.8 14.30693 35.51234 0 0
## xmonth4 xmonth5 xmonth6 xmonth7 xmonth8 xmonth9 xmonth10 xmonth11
## 1 0 0 0 1 0 0 0 0
## 2 0 0 0 1 0 0 0 0
## 3 0 0 0 1 0 0 0 0
## 4 0 0 0 1 0 0 0 0
## 5 0 0 0 1 0 0 0 0
## 6 0 0 0 1 0 0 0 0
## xmonth12 nao cfin1Changed cfin1Zero cfin2Changed cfin2Zero chel1Changed
## 1 0 2.8 -3.583439 0 -1.280674 0 1.243794
## 2 0 2.8 -3.583439 0 -1.280674 0 1.243794
## 3 0 2.8 -3.583439 0 -1.280674 0 1.243794
## 4 0 2.8 -3.583439 0 -1.280674 0 1.243794
## 5 0 2.8 -3.583439 0 -1.280674 0 1.243794
## 6 0 2.8 -3.583439 0 -1.280674 0 1.243794
## lcop1Changed
## 1 1.266347
## 2 1.266347
## 3 1.266347
## 4 1.266347
## 5 1.266347
## 6 1.266347
Some attributes in this dataset has values measured in tens of thousands while others never exceeds a value of one. If our data will stay in that form some attributes will contribute greatly to the final result while others will be ignored. To make sure that all data equaly participates in learning process it has to be normalized in one way or another. We decided to use simple scale and centering methods from caret library. But before that data had to be divided into test and training set. Fitting normalization model can be done just on training dataset, as it would introduce data leak if conducted on the whole dataset.
ctrl <- trainControl(method = "cv", number=8, preProcOptions = c(pcaComp=32), allowParallel = TRUE)
grid <- expand.grid()
lm <- train(length~., data = reg_data, method='lm', trControl=ctrl, preProcess=c('center', 'scale', 'pca'))
lm
## Linear Regression
##
## 52582 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), principal component
## signal extraction (30)
## Resampling: Cross-Validated (8 fold)
## Summary of sample sizes: 46010, 46009, 46009, 46010, 46009, 46008, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.322786 0.359529 1.053329
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
registerDoMC(cores=4)
ctrl <- trainControl(method = "cv", number=8, preProcOptions = c(psaComp=3), allowParallel = TRUE)
grid <- expand.grid(max_depth=c(3,4,5),
subsample=c(1.0),
nrounds=c(200),
eta=c(0.4, 0.3),
gamma=c(0.001,0.01),
colsample_bytree=c(0.7),
min_child_weight=c(1.0, 0.5))
xgb_tree <- train(length~., data = reg_data, method='xgbTree', trControl=ctrl, tuneGrid=grid, preProcess=c('center', 'scale'))
xgb_tree
## eXtreme Gradient Boosting
##
## 52582 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30)
## Resampling: Cross-Validated (8 fold)
## Summary of sample sizes: 46010, 46010, 46010, 46009, 46009, 46008, ...
## Resampling results across tuning parameters:
##
## eta max_depth gamma min_child_weight RMSE Rsquared MAE
## 0.3 3 0.001 0.5 1.143325 0.5215935 0.9005966
## 0.3 3 0.001 1.0 1.142940 0.5219241 0.9004551
## 0.3 3 0.010 0.5 1.143485 0.5214670 0.9006081
## 0.3 3 0.010 1.0 1.143756 0.5212325 0.9009772
## 0.3 4 0.001 0.5 1.138971 0.5252179 0.8962361
## 0.3 4 0.001 1.0 1.139873 0.5244800 0.8971574
## 0.3 4 0.010 0.5 1.139346 0.5249190 0.8968028
## 0.3 4 0.010 1.0 1.139225 0.5250204 0.8963911
## 0.3 5 0.001 0.5 1.139983 0.5244590 0.8965880
## 0.3 5 0.001 1.0 1.140276 0.5242202 0.8969980
## 0.3 5 0.010 0.5 1.139538 0.5248331 0.8961700
## 0.3 5 0.010 1.0 1.139754 0.5246495 0.8963924
## 0.4 3 0.001 0.5 1.142713 0.5220969 0.9000945
## 0.4 3 0.001 1.0 1.142264 0.5224841 0.8995259
## 0.4 3 0.010 0.5 1.141527 0.5231022 0.8986428
## 0.4 3 0.010 1.0 1.141946 0.5227398 0.8987036
## 0.4 4 0.001 0.5 1.139576 0.5247556 0.8965799
## 0.4 4 0.001 1.0 1.140238 0.5242225 0.8967870
## 0.4 4 0.010 0.5 1.140291 0.5241734 0.8969215
## 0.4 4 0.010 1.0 1.140240 0.5242122 0.8969085
## 0.4 5 0.001 0.5 1.141842 0.5229918 0.8974913
## 0.4 5 0.001 1.0 1.141842 0.5230002 0.8976417
## 0.4 5 0.010 0.5 1.141364 0.5233983 0.8972330
## 0.4 5 0.010 1.0 1.141777 0.5230539 0.8975537
##
## Tuning parameter 'nrounds' was held constant at a value of 200
##
## Tuning parameter 'colsample_bytree' was held constant at a value of
## 0.7
## Tuning parameter 'subsample' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 200, max_depth = 4,
## eta = 0.3, gamma = 0.001, colsample_bytree = 0.7, min_child_weight =
## 0.5 and subsample = 1.